Baseball
Long-form factuality in large language models
Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for longform factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results. Furthermore, we propose extending F1 score as an aggregated metric for long-form factuality.
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
Current universal segmentation methods demonstrate strong capabilities in pixellevel image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction.
A Differentiable Semantic Metric Approximation in Probabilistic Embedding for Cross-Modal Retrieval
Cross-modal retrieval aims to build correspondence between multiple modalities by learning a common representation space. Typically, an image can match multiple texts semantically and vice versa, which significantly increases the difficulty of this task. To address this problem, probabilistic embedding is proposed to quantify these many-to-many relationships. However, existing datasets (e.g., MS-COCO) and metrics (e.g., Recall@K) cannot fully represent these diversity correspondences due to non-exhaustive annotations. Based on this observation, we utilize semantic correlation computed by CIDEr to find the potential correspondences. Then we present an effective metric, named Average Semantic Precision (ASP), which can measure the ranking precision of semantic correlation for retrieval sets. Additionally, we introduce a novel and concise objective, coined Differentiable ASP Approximation (DAA).
Accelerating Transformers with Spectrum-Preserving Token Merging 1,3,4
Increasing the throughput of the Transformer architecture, a foundational component used in numerous state-of-the-art models for vision and language tasks (e.g., GPT, LLaVa), is an important problem in machine learning. One recent and effective strategy is to merge token representations within Transformer models, aiming to reduce computational and memory requirements while maintaining accuracy. Prior works have proposed algorithms based on Bipartite Soft Matching (BSM), which divides tokens into distinct sets and merges the top k similar tokens. However, these methods have significant drawbacks, such as sensitivity to tokensplitting strategies and damage to informative tokens in later layers.
Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias Yue Yu
Large language models (LLMs) have been recently leveraged as training data generators for various natural language processing (NLP) tasks. While previous research has explored different approaches to training models using generated data, they generally rely on simple class-conditional prompts, which may limit the diversity of the generated data and inherit systematic biases of LLM. Thus, we investigate training data generation with diversely attributed prompts (e.g., specifying attributes like length and style), which have the potential to yield diverse and attributed generated data. Our investigation focuses on datasets with high cardinality and diverse domains, wherein we demonstrate that attributed prompts outperform simple class-conditional prompts in terms of the resulting model's performance.
Jungo Kasai
Q: How many home runs has Shohei Ohtani hit? Why was the dataset created? Q: How many home runs has Shohei Ohtani hit? QA was created to provide a to benchmark question answering at the dynamic platform that asks questions about the present time: answers (e.g., the number of current world, challenging QA systems to provide Shohei Ohtani's home runs) change in real time. QA may identify areas of potential research, such as improving how QA systems deal with unanswerable What are the instances?
PitcherNet helps researchers throw strikes with AI analysis
University of Waterloo researchers have developed new artificial intelligence (AI) technology that can accurately analyze pitcher performance and mechanics using low-resolution video of baseball games. The system, developed for the Baltimore Orioles by the Waterloo team, plugs holes in much more elaborate and expensive technology already installed in most stadiums that host Major League Baseball (MLB), whose teams have increasingly tapped into data analytics in recent years. Waterloo researchers convert video of a pitcher's performance into a two-dimensional model that PitcherNet's AI algorithm can later analyze. Those systems, produced by a company called Hawk-Eye Innovations, use multiple special cameras in each park to catch players in action, but the data they yield is typically available to the home team that owns the stadium those games are played in. To add away games to their analytics operation, as well as use smartphone video taken by scouts in minor league and college games, the Orioles asked video and AI experts at Waterloo for help about three years ago.